53 research outputs found
Semi-parametric analysis of multi-rater data
Datasets that are subjectively labeled by a number of experts are becoming more common in tasks such as biological text annotation where class definitions are necessarily somewhat subjective. Standard classification and regression models are not suited to multiple labels and typically a pre-processing step (normally assigning the majority class) is performed. We propose Bayesian models for classification and ordinal regression that naturally incorporate multiple expert opinions in defining predictive distributions. The models make use of Gaussian process priors, resulting in great flexibility and particular suitability to text based problems where the number of covariates can be far greater than the number of data instances. We show that using all labels rather than just the majority improves performance on a recent biological dataset
Recommended from our members
RELPRON: A Relative Clause Evaluation Data Set for Compositional Distributional Semantics
This article introduces RELPRON, a large data set of subject and object relative clauses, for the evaluation of methods in compositional distributional semantics. RELPRON targets an intermediate level of grammatical complexity between content-word pairs and full sentences. The task involves matching terms, such as “wisdom,” with representative properties, such as “quality that experience teaches.” A unique feature of RELPRON is that it is built from attested properties, but without the need for them to appear in relative clause format in the source corpus. The article also presents some initial experiments on RELPRON, using a variety of composition methods including simple baselines, arithmetic operators on vectors, and finally, more complex methods in which argument-taking words are represented as tensors. The latter methods are based on the Categorial framework, which is described in detail. The results show that vector addition is difficult to beat—in line with the existing literature—but that an implementation of the Categorial framework based on the Practical Lexical Function model is able to match the performance of vector addition. The article finishes with an in-depth analysis of RELPRON, showing how results vary across subject and object relative clauses, across different head nouns, and how the methods perform on the subtasks necessary for capturing relative clause semantics, as well as providing a qualitative analysis highlighting some of the more common errors. Our hope is that the competitive results presented here, in which the best systems are on average ranking one out of every two properties correctly for a given term, will inspire new approaches to the RELPRON ranking task and other tasks based on linguistically interesting constructions.Laura Rimell and Stephen Clark were supported by EPSRC grant EP/I037512/1. Jean Maillard is supported by an EPSRC Doctoral Training Grant and a St. John’s Scholarship. Laura Rimell, Tamara Polajnar, and Stephen Clark are supported by ERC Starting Grant DisCoTex (306920)
Evolving Gaussian Process Kernels for Translation Editing Effort Estimation
In many Natural Language Processing problems the combination of machine learning and optimization techniques is essential. One of these problems is estimating the effort required to improve, under direct human supervision, a text that has been translated using a machine translation method. Recent developments in this area have shown that Gaussian Processes can be accurate for post-editing effort prediction. However, the Gaussian Process kernel has to be chosen in advance, and this choice in- fluences the quality of the prediction. In this paper, we propose a Genetic Programming algorithm to evolve kernels for Gaussian Processes. We show that the combination of evolutionary optimization and Gaussian Processes removes the need for a-priori specification of the kernel choice, and achieves predictions that, in many cases, outperform those obtained with fixed kernels.TIN2016-78365-
Semi-supervised Prediction of Protein Interaction Sentences Exploiting Semantically Encoded Metrics.
Protein interaction detection in sentences via gaussian processes: a preliminary evaluation
Classification methods are vital for efficient access of knowledge hidden in biomedical publications. Support vector machines (SVMs) are modern non-parametric deterministic classifiers that produce state of the art performances in text mining, and across other disciplines, while reducing the need for feature engineering. In this paper we offer a much needed evaluation of the Gaussian Process (GP) classifier, as a non-parametric probabilistic analogue to SVMs, which has been rarely applied to text classification. To this end, we provide an extensive experimental comparison of the performance and properties of these competing classifiers on the challenging problem of protein interaction detection in biomedical publications. Our results show that GPs can match the performance of SVMs without the need for costly margin parameter tuning, whilst offering the advantage of an extendable probabilistic framework for text classification
Finding and filtering information for children
Children face several challenges when using information access systems. These include formulating queries, judging the relevance of documents, and focusing attention on interface cues, such as query suggestions, while typing queries. It has also been shown that children want a personalised Web experience and prefer content presented to them that matches their long-term entertainment and education needs. To this end, we have developed an interaction-based information filtering system to address these challenges
- …